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Abstract 


Pronoun resolution is a major area of natural 
language understanding. However, large-scale 
training sets are still scarce, since manually la- 
belling data is costly. In this work, we intro- 
duce WIKICREM (Wikipedia CoREferences 
Masked) a large-scale, yet accurate dataset 
of pronoun disambiguation instances. We 
use a language-model-based approach for pro- 
noun resolution in combination with our WI- 
KICREM dataset. We compare a series of 
models on a collection of diverse and challeng- 
ing coreference resolution problems, where 
we match or outperform previous state-of-the- 
art approaches on 6 out of 7 datasets, such 
as GAP, DPR, WNLI, PDP, WINOBIAS, and 
WINOGENDER. We release our model to be 
used off-the-shelf for solving pronoun disam- 
biguation. 


1 Introduction 


Pronoun resolution, also called coreference or ana- 
phora resolution, is a natural language processing 
(NLP) task, which aims to link the pronouns with 
their referents. This task is of crucial importance 
in various other NLP tasks, such as information 
extraction (Nakayama, 2019) and machine trans- 
lation (Guillou, 2012). Due to its importance, pro- 
noun resolution has seen a series of different ap- 
proaches, such as rule-based systems (Lee et al., 
2013) and end-to-end-trained neural models (Lee 
et al., 2017; Liu et al., 2019). However, the re- 
cently released dataset GAP (Webster et al., 2018) 
shows that most of these solutions perform worse 
than naive baselines when the answer cannot be 
deduced from the syntax. Addressing this draw- 
back is difficult, partially due to the lack of large- 
scale challenging datasets needed to train the data- 
hungry neural models. 

As observed by Trinh and Le (2018), language 
models are a natural approach to pronoun resolu- 
tion, by selecting the replacement for a pronoun 
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that forms the sentence with highest probability. 
Additionally, language models have the advantage 
of being pre-trained on a large collection of un- 
structured text and then fine-tuned on a specific 
task using much less training data. This proce- 
dure has obtained state-of-the-art results on a se- 
ries of natural language understanding tasks (De- 
vlin et al., 2018). 

In this work, we address the lack of large train- 
ing sets for pronoun disambiguation by introduc- 
ing a large dataset that can be easily extended. 
To generate this dataset, we find passages of text 
where a personal name appears at least twice and 
mask one of its non-first occurrences. To make 
the disambiguation task more challenging, we also 
ensure that at least one other distinct personal 
name is present in the text in a position before the 
masked occurrence. We instantiate our method on 
English Wikipedia and generate the Wikipedia Co- 
REferences Masked (WIKICREM) dataset with 
2.4M examples, which we make publicly available 
for further usage!. We show its value by using it to 
fine-tune the BERT language model (Devlin et al., 
2018) for pronoun resolution. 

To show the usefulness of our dataset, we train 
several models that cover three real-world sce- 
narios: (1) when the target data distribution is 
completely unknown, (2) when training data from 
the target distribution is available, and (3) the 
transductive scenario, where the unlabeled test 
data is available at the training time. We show 
that fine-tuning BERT with WIKICREM consis- 
tently improves the model in each of the three 
scenarios, when evaluated on a collection of 7 
datasets. For example, we outperform the state-of- 


'The code can be found at https://github.com/ 
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the-art approaches on GAP (Webster et al., 2018), 
DPR (Rahman and Ng, 2012), and PDP (Davis 
et al., 2017) by 5.9%, 8.4%, and 12.7%, respec- 
tively. Additionally, models trained with WIKI- 
CREM show increased performance and reduced 
bias on the gender diagnostic datasets WINOGEN- 
DER (Rudinger et al., 2018) and WINOBIAS (Zhao 
et al., 2018). 


2 Related Work 


There are several large and commonly used bench- 
marks for coreference resolution, such as (Prad- 
han et al., 2012; Schafer et al., 2012; Ghaddar and 
Langlais, 2016). However, Webster et al. (2018) 
argue that a high performance on these datasets 
does not correlate with a high accuracy in practice, 
because examples where the answer cannot be de- 
duced from the syntax (we refer to them as hard 
pronoun resolution) are underrepresented. There- 
fore, several hard pronoun resolution datasets have 
been introduced (Webster et al., 2018; Rahman 
and Ng, 2012; Rudinger et al., 2018; Davis et al., 
2017; Zhao et al., 2018; Emami et al., 2019). 
However, they are all relatively small, often cre- 
ated only as a test set. 

Therefore, most of the pronoun resolution mod- 
els that address hard pronoun resolution rely on 
little (Liu et al., 2019) or no training data, via un- 
supervised pre-training (Trinh and Le, 2018; Rad- 
ford et al., 2019). Another approach involves us- 
ing external knowledge bases (Emami et al., 2018; 
Fahndrich et al., 2018), however, the accuracy of 
these models still lags behind that of the aforemen- 
tioned pre-trained models. 

A similar approach to ours for unsupervised 
data generation and language-model-based eval- 
uation has been recently presented in our pre- 
vious work (Kocijan et al., 2019). We gener- 
ated MASKEDWIKI, a large unsupervised dataset 
created by searching for repeated occurrences of 
nouns. However, training on MASKEDWIKI on its 
own is not always enough and sometimes makes 
a difference only in combination with additional 
training on the DPR dataset (called WSCR) (Rah- 
man and Ng, 2012). In contrast, WIKICREM 
brings a much more consistent improvement over 
a wider range of datasets, strongly improving 
models’ performance even when they are not fine- 
tuned on additional data. As opposed to our previ- 
ous work (Kocijan et al., 2019), we evaluate mod- 
els on a larger collection of test sets, showing the 


usefulness of WIKICREM beyond the Winograd 
Schema Challenge. 

Moreover, a manual comparison of WIKI- 
CREM and MASKEDWIKI (Kocijan et al., 2019) 
shows a significant difference in the quality of 
the examples. We annotated 100 random exam- 
ples from MASKEDWIKI and WIKICREM. In 
MASKEDWIKI, we looked for examples where 
masked nouns can be replaced with a pronoun, 
and only in 7 examples, we obtained a natural- 
sounding and grammatically correct sentence. In 
contrast, we estimated that 63% of the anno- 
tated examples in WIKICREM form a natural- 
sounding sentence when the appropriate pronoun 
is inserted, showing that WIKICREM consists of 
examples that are much closer to the target data. 
We highlight that pronouns are not actually in- 
serted into the sentences and thus none of the ex- 
amples sound unnatural. This analysis was per- 
formed to show that WIKICREM consists of ex- 
amples with data distribution closer to the target 
tasks than MASKEDWIKI. 


3 The WIKICREM Dataset 


In this section, we describe how we obtained WI- 
KICREM. Starting from English Wikipedia”, we 
search for sentences and pairs of sentences with 
the following properties: at least two distinct per- 
sonal names appear in the text, and one of them is 
repeated. We do not use pieces of text with more 
than two sentences to collect concise examples 
only. Personal names in the text are called “can- 
didates”. One non-first occurrence of the repeated 
candidate is masked, and the goal is to predict the 
masked name, given the correct and one incorrect 
candidate. In case of more than one incorrect can- 
didate in the sentence, several datapoints are con- 
structed, one for each incorrect candidate. 

We ensure that the alternative candidate appears 
before the masked-out name in the text, in order 
to avoid trivial examples. Thus, the example is 
retained in the dataset if: 

(a) the repeated name appears after both candi- 
dates, all in a single sentence; or 
(b) both candidates appear in a single sentence, 
and the repeated name appears in a sentence 
directly following. 
Examples where one of the candidates appears in 
the same sentence as the repeated name, while the 
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other candidate does not, are discarded, as they are 
often too trivial. 

We illustrate the procedure with the following 
example: 


When asked about Adams’ report, Powell found 
many of the statements to be inaccurate, including 
a claim that Adams first surveyed an area that was 
surveyed in 1857 by Joseph C. 

The second occurrence of “Adams” is masked. 
The goal is to determine which of the two candi- 
dates (“Adams”,“Powell”) has been masked out. 
The masking process resembles replacing a name 
with a pronoun, but the pronoun is not inserted to 
keep the process fully unsupervised and error-free. 

We used the Spacy Named Entity Recognition 
library? to find the occurrences of names in the 
text. The resulting dataset consists of 2, 438, 897 
samples. 10, 000 examples are held out to serve as 
the validation set. Two examples from our dataset 
can be found on Figure 1. 


Gina arrives and she is furious with Denise for not 
protecting Jody from Kingsley, as [MASK] was 
meant to be the parent. 

Candidates: Gina, Denise 


When Ashley falls pregnant with Victor’s child, 
Nikki is diagnosed with cancer, causing Victor to 
leave [MASK], who secretly has an abortion. 
Candidates: Ashley, Nikki 


Figure 1: WIKICREM examples. Correct answers are 
given in bold. 


We note that our dataset contains hard exam- 
ples. To resolve the first example, one needs to 
understand that Denise was assigned a task and 
“meant to be the parent” thus refers to her. To 
resolve the second example, one needs to under- 
stand that having an abortion can only happen if 
one falls pregnant first. Since both candidates 
have feminine names, the answer cannot be de- 
duced just on the common co-occurrence of fe- 
male names and the word “abortion”. 

We highlight that our example generation 
method, while having the advantage of being un- 
supervised, also does not give incorrect signals, 
since we know the ground truth reference. 

Even though WIKICREM and GAP both use 
text from English Wikipedia, they produce dif- 
fering examples, because their generating pro- 
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cesses differ. In GAP, passages with pronouns 
are collected and the pronouns are manually anno- 
tated, while WIKICREM is generated by masking 
names that appear in the text. Even if the same 
text is used, different masking process will result 
in different inputs and outputs, making the exam- 
ples different under the transductive hypothesis. 


WIKICREM statistics. We analyze our dataset 
for gender bias. We use the Gender guesser li- 
brary* to determine the gender of the candidates. 
To mimic the analysis of pronoun genders per- 
formed in the related works (Webster et al., 2018; 
Rudinger et al., 2018; Zhao et al., 2018), we ob- 
serve the gender of the correct candidates only. 
There were 0.8M “male” or “mostly_male” names 
and 0.42M “female” or “mostly_female” names, 
the rest were classified as “unknown”. The ratio 
between female and male candidates is thus esti- 
mated around 0.53 in favour of male candidates. 
We will see that this gender imbalance does not 
have any negative impact on bias, as shown in Sec- 
tion 6.2. 

However, our unsupervised generating proce- 
dure sometimes yields examples where the cor- 
rect answer cannot be deduced given the available 
information, we refer to these as unsolvable ex- 
amples. To estimate the percentage of unsolvable 
examples, we manually annotated 100 randomly 
selected examples from the WIKICREM dataset. 
In order to prevent guessing, the candidates were 
not visible to the annotators. For each example, 
we asked them to state whether it was solvable or 
not, and to answer the solvable examples. In 100 
examples, we found 18 unsolvable examples and 
achieved 95.1% accuracy on the rest, showing that 
the annotation error rate is tolerable. These anno- 
tations can be found in Appendix A. 

However, as shown in Section 6.2, training on 
WIKICREM alone does not match the perfor- 
mance of training on the data from the target dis- 
tribution. The data distribution of WIKICREM 
differs from the data distribution of the datasets 
for evaluation. If we replace the [MASK] token 
with a pronoun instead of the correct candidate, 
the resulting sentence sometimes sounds unnatural 
and would not occur in a human-written text. On 
the annotated 100 examples, we estimated the per- 
centage of natural-sounding sentences to be 63%. 
While the these sentences are not incorrect, the 
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distribution of the training data differ from the tar- 
get data. 


4 Model 


We use a simple language-model-based approach 
to anaphora resolution to show the value of the in- 
troduced dataset. In this section, we first introduce 
BERT (Devlin et al., 2018), a language model that 
we use throughout this work. In the second part, 
we describe the utilization of BERT and the fine- 
tuning procedures employed. 


4.1 BERT 


The Bidirectional Encoder Representations from 
Transformers (BERT) language model is based on 
the transformer architecture (Vaswani et al., 2017). 
We choose this model due to its strong language- 
modeling abilities and high performance on sev- 
eral NLU tasks (Devlin et al., 2018). 

BERT is initially trained on two tasks: next 
sentence prediction and masked token prediction. 
In the next sentence prediction task, the model 
is given two sentences and is asked to predict 
whether the second sentence follows the first one. 
In the masked token prediction task, the model is 
given text with approximately 15% of the input 
tokens masked, and it is asked to predict these to- 
kens. The details of the pre-training procedure can 
be found in Devlin et al. (2018). 

In this work, we only focus on the masked token 
prediction. We use the PyTorch implementation of 
BERT? and the pre-trained weights for BERT-large 
released by Devlin et al. (2018). 


4.2 Pronoun Resolution with BERT 


This section introduces the procedure for pronoun 
resolution used throughout this work. Let S be 
the sentence with a pronoun that has to be re- 
solved. Let a be a candidate for pronoun res- 
olution. The pronoun in S is replaced with a 
[MASK] token and used as the input to the model 
to compute the log-probability log P(a|S). If a 
consists of more than one token, the same number 
of [MASK] tokens is inserted into S, and the log- 
probability log P(a|S) is computed as the average 
of log-probabilities of all tokens in a. 

The candidate-finding procedures are dataset- 
specific and are described in Section 6. Given 
a sentence S and several candidates aj,...,an, 


Shttps://github.com/huggingface/ 
pytorch-pretrained-BERT 


we select the candidate a; with the largest 
log P(a;|S). 


4.3 Training 


When training the model, the setup is similar to 
testing. We are given a sentence with a name or 
a pronoun masked out, together with two candi- 
dates. The goal is to determine which of the candi- 
dates is a better fit. Let a be the correct candidate, 
and b be an incorrect candidate. Following our 
previous work (Kocijan et al., 2019) we minimize 
the negative log-likelihood of the correct candi- 
date, while additionally imposing a max-margin 
between the log-likelihood of the correct and in- 
correct terms. We observe that this combined loss 
consistently yields better results on validation sets 
of all experiments than negative log-likelihood or 
max-margin loss on their own. 


£ = — log P(al$)+ (1) 
+a - max(0, log P(b| S) — log P(a|S) + 8), 


where a and ( are hyperparameters controlling 
the influence of the max-margin loss term and the 
margin between the log-likelihood of the correct 
and incorrect candidates, respectively. 

The hyperparameter settings for fine-tuning 
BERT on WIKICREM were the same as by De- 
vlin et al. (2018), except for the learning rate and 
introduced constants œ and 8. For our hyperpa- 
rameter search, we used learning rate Ir € {3 - 
1075, 1-1075, 5-1076, 3-1076} and hyperparame- 
ters a € {5, 10, 20}, 8 € {0.1, 0.2, 0.4} with grid 
search. The hyperparameter search is performed 
on a subset of WIKICREM with 10° datapoints 
to reduce the searching time. We compare the in- 
fluence of hyperparameters on the validation set 
of WIKICREM dataset. The best validation score 
was achieved with Ir = 1-1075, a = 10, and 
B = 0.2. We used batches of size 64. 

Since WIKICREM is large and one epoch takes 
around two days even when parallelized on 8 Tesla 
P100 GPUs, we only fine-tune BERT on WIKI- 
CREM for a single epoch. We note that better 
results may be achieved with further fine-tuning 
and improved hyperparameter search. 

Fine-tuning on other datasets is performed in 
the same way as training except for two differ- 
ences. Firstly, in fine-tuning, the model is trained 
for 30 epochs due to the smaller size of datasets. 
Secondly, we do not sub-sample the training set 
for hyperparameter search. We validate the model 
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after every epoch, retaining the model that per- 
forms best on the WIKICREM validation set. 


5 Evaluation Datasets 


We now introduce the 7 datasets that were used 
to evaluate the models. We decide not to use the 
CONLL2012 and WINOCOREF (Pradhan et al., 
2012; Peng et al., 2015) datasets, because they 
contain more general coreference examples than 
just pronouns. We did not evaluate on the KNOW- 
REF dataset (Emami et al., 2019), since it was not 
yet publicly available at the time of writing. 


GAP. GAP (Webster et al., 2018) is a collection 
of 4, 454 passages from Wikipedia containing am- 
biguous pronouns. It focuses on the resolution of 
personal pronouns referring to human names and 
has a 1 : 1 ratio between masculine and feminine 
pronouns. In addition to the overall performance 
on the dataset, each model is evaluated also on its 


performance on the masculine subset (F ve ), fem- 


inine subset (FË), and its gender bias es The 
best performance was exhibited by the Referen- 
tial Reader (Liu et al., 2019), a GRU-based model 
with additional external memory cells. 

For each example, two candidates are given 
with the goal of determining whether they are the 
referent. In approximately 10% of the training ex- 
amples, none of the candidates are correct. When 
training on the GAP dataset, we discard such ex- 
amples from the training set. We do not discard 
any examples from the validation or test set. 

When testing the model, we use the Spacy NER 
library to find all candidates in the sentence. Since 
the GAP dataset mainly contains examples with 
human names, we only retain named entities with 
the tag PERSON. We observe that in 18.5% of the 
test samples, the Spacy NER library fails to ex- 
tract the candidate in question, making the answer 
for that candidate “FALSE” by default, putting our 
models at disadvantage. Because of this, 7.25% of 
answers are always false negatives, and 11.25% 
are always true negatives, regardless of the model. 
Taking this into account, we compute that the 
maximal F}-score achievable by our models is 
capped at 91.1%. 

We highlight that, when evaluating our models, 
we are stricter than previous approaches (Liu et al., 
2019; Webster et al., 2018). While they count the 
answer as “correct” if the model returns a sub- 
string of the correct answer, we only accept the 


full answer. The aforementioned models return the 
exact location of the correct candidate in the input 
sentence, while our approach does not. This strict- 
ness is necessary, because a substring of a correct 
answer could be a substring of several answers at 
once, making it ambiguous. 


Wsc. The Winograd Schema Challenge (Leves- 
que et al., 2011) is a hard pronoun resolution 
challenge inspired by the example from Wino- 
grad (1972): 


The city councilmen refused the demonstrators a 
permit because they [feared/advocated] violence. 
Question: Who [feared/advocated] violence? 

Answer: the city councilmen / the demonstrators 


A change of a single word in the sentence 
changes the referent of the pronoun, making it 
very hard to resolve. An example of a Winograd 
Schema must meet the following criteria (Leves- 
que et al., 2011): 

1. Two entities appear in the text. 

2. A pronoun or a possessive adjective appears 
in the sentence and refers to one of the enti- 
ties. It would be grammatically correct if it 
referred to the other entity. 

3. The goal is to find the referent of the pronoun 
or possessive adjective. 

4. The text contains a “special word”. When 
switched for the “alternative word”, the sen- 
tence remains grammatically correct, but the 
referent of the pronoun changes. 

The Winograd Schema Challenge is specifically 
made up from challenging examples that require 
commonsense reasoning for resolution and should 
not be solvable with statistical analysis of co- 
occurence and association. 

We evaluate the models on the collection of 
273 problems used for the 2016 Winograd Schema 
Challenge (Davis et al., 2017), also known as 
Wsc273. The best known approach to this prob- 
lem uses the BERT language model, fine-tuned on 
the DPR dataset (Kocijan et al., 2019). 


DPR. The Definite Pronoun Resolution (DPR) 
corpus (Rahman and Ng, 2012) is a collection 
of problems that resemble the Winograd Schema 
Challenge. The criteria for this dataset have 
been relaxed, and it contains examples that might 
not require commonsense reasoning or examples 
where the “special word” is actually a whole 
phrase. We remove 6 examples in the DPR train- 
ing set that overlap with the WSC dataset. The 
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dataset was constructed manually and consists of 
1316 training and 564 test samples after we re- 
moved the overlapping examples. The best result 
on the dataset was reported by Peng et al. (2015) 
using external knowledge sources and integer lin- 
ear programming. 


Ppp. The Pronoun Disambiguation Problem 
(PDP) is a small collection of 60 problems that was 
used as the first round of the Winograd Schema 
Challenge in 2016 (Davis et al., 2017). Unlike 
Wsc, the examples do not contain a “special 
word”, however, they do require commonsense 
reasoning to be answered. The examples were 
manually collected from books. Despite its small 
size, there have been several attempts at solving 
this challenge (Fahndrich et al., 2018; Trinh and 
Le, 2018), the best result being held by the Marker 
Passing algorithm (Fahndrich et al., 2018). 


WNLI. The Winograd Natural Language Infer- 
ence (WNLI) is an inference task inspired by the 
Winograd Schema Challenge and is one of the 9 
tasks on the GLUE benchmark (Wang et al., 2019). 
WNLI examples are obtained by rephrasing Wino- 
grad Schemas. The Winograd Schema is given as 
the “premise”. A “hypothesis” is constructed by 
repeating the part of the premise with the pronoun 
and replacing the pronoun with one of the candi- 
dates. The goal is to classify whether the hypoth- 
esis follows from the premise. 

A WNLI example obtained by rephrasing one of 
the WSC examples looks like this: 


Premise: The city councilmen refused the demon- 
strators a permit because they feared violence. 
Hypothesis: The demonstrators feared violence. 
Answer: true / false 


The WNLI dataset is constructed manually. 
Since the WNLI training and validation sets over- 
lap with WSC, we use the WNLI test set only. The 
test set of WNLI comes from a separate source and 
does not overlap with any other dataset. 

The currently best approach transforms exam- 
ples back into the Winograd Schemas and solves 
them as a coreference problem (Kocijan et al., 
2019). Following our previous work (Kocijan 
et al., 2019), we reverse the process of example 
generation in the same way. We automatically de- 
tect which part of the premise has been copied to 
construct the hypothesis. This locates the pronoun 
that has to be resolved, and the candidate in ques- 
tion. All other nouns in the premise are treated 


as alternative candidates. We find nouns in the 
premise with the Stanford POS tagger (Manning 
et al., 2014). 


WINOGENDER. WINOGENDER (Rudinger 
et al., 2018) is a dataset that follows the WSC 
format and is aimed to measure gender bias. 
One of the candidates is always an occupation, 
while the other is a participant, both selected 
to be gender neutral. Examples intentionally 
contain occupations with strong imbalance in the 
gender ratio. Participant can be replaced with the 
neutral “someone”, and three different pronouns 
(he/she/they) can be used. The aim of this dataset 
is to measure how the change of the pronoun 
gender affects the accuracy of the model. 


Our models mask the pronoun and are thus not 
affected by the pronoun gender. They exhibit no 
bias on this dataset by design. We mainly use this 
dataset to measure the accuracy of different mod- 
els on the entire dataset. According to Rudinger 
et al. (2018), the best performance is exhibited by 
Durrett and Klein (2013) when used on the male 
subset of the dataset. We use this result as the 
baseline. 


WINOBIAS. Similarly to the WINOGENDER 
dataset, WINOBIAS (Zhao et al., 2018) is a WSC- 
inspired dataset that measures gender bias in the 
coreference resolution algorithms. Similarly to 
WINOGENDER, it contains instances of occupa- 
tions with high gender imbalance. It contains 
3,160 examples of Winograd Schemas, equally 
split into validation and test set. The test set exam- 
ples are split into 2 types, where examples of type 
1 are “harder” and should not be solvable using the 
analysis of co-occurrence, and examples of type 2 
are easier. Additionally, each of these subsets is 
split into anti-stereotypical and pro-stereotypical 
subsets, depending on whether the gender of the 
pronoun matches the most common gender in the 
occupation. The difference in performance be- 
tween pro- and anti-stereotypical examples shows 
how biased the model is. The best performance 
is exhibited by Lee et al. (2017) and Durrett and 
Klein (2013), as reported by Zhao et al. (2018). 


6 Evaluation 


We quantify the impact of WIKICREM on the in- 
troduced datasets. 
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6.1 Experiments 


We train several different models to evaluate the 
contribution of the WIKICREM dataset in differ- 
ent real-world scenarios. In Scenario A, no infor- 
mation of the target distribution is available. In 
Scenario B, the distribution of the target data is 
known and a sample of training data from the tar- 
get distribution is available. Finally, Scenario C is 
the transductive scenario where the unlabeled test 
samples are known in advance. All evaluations on 
the GAP test-set are considered to be Scenario C, 
because BERT has been pre-trained on the English 
Wikipedia and has thus seen the text in the GAP 
dataset at the pre-training time. 
We describe the evaluated models below. 


BERT. This model, pretrained by Devlin 
et al. (2018), is the starting point for all models 
and serves as the soft baseline for Scenario A. 


BERT_WIKIRAND. This model serves as an 
additional baseline for Scenario A and aims to 
eliminate external factors that might have worked 
against the performance of BERT. To eliminate 
the effect of sentence lengths, loss function, and 
the percentage of masked tokens during the train- 
ing time, we generate the RANDOM WIKI dataset. 
It consists of random passages from Wikipedia 
and has the same sentence-length distribution and 
number of datapoints as WIKICREM. However, 
the masked-out word from the sentence is selected 
randomly, while the alternative candidate is se- 
lected randomly from the vocabulary. BERT is 
then trained on this dataset in the same way as 
BERT_WIKICREM, as described in Section 4.3. 


BERT_WIKICREM. BERT, additionally train- 
ed on WIKICREM. Its evaluation on non-GAP 
datasets serves as the evaluation of WIKICREM 
under Scenario A. 


BERT_DPR. BERT, fine-tuned on DPR. We hold 
out 10% of the DPR train set (131 examples) to 
use them as the validation set. All datasets, other 
than GAP, were inspired by the Winograd Schema 
Challenge and come from a similar distribution. 
We use this model as the baseline for Scenario B. 


BERT_WIKICREM DPR. This model is ob- 
tained by fine-tuning BERT.WIKICREM on DPR 
using the same split as for BERT_DPR. It serves 
as the evaluation of WIKICREM under Sce- 
nario B. 


BERT_GAP_DpR. This model serves as an addi- 
tional comparison to the BERT_WIKICREM_DPR 
model. It is obtained by fine-tuning BERT-GAP on 
the DPR dataset. 


BERT_GAP. This model is obtained by fine- 
tuning BERT on the GAP dataset. It serves as the 
baseline for Scenario C, as explained at the begin- 
ning of Section 6.1. 


BERT_WIKICREM_GAP. This model serves 
as the evaluation of WIKICREM for Scenario C 
and is obtained by fine-tuning BERT-WIKICREM 
on GAP. 


BERT_ALL. This model is obtained by fine- 
tuning BERT on all the available data from the 
target datasets at once. Combined GAP-train and 
DPR-train data are used for training. The model 
is validated on the GAP-validation set and the 
WINOBIAS-validation set separately. Scores on 
both sets are then averaged to obtain the validation 
performance. Since both training sets and both 
validation sets have roughly the same size, both 
tasks are represented equally. 


BERT_WIKICREM_ALL. This model is ob- 
tained in the same way as the BERT_ALL model, 
but starting from BERT_WIKICREM instead. 


6.2 Results 


The results of the evaluation of the models on the 
test sets are shown in Table 1. We notice that ad- 
ditional training on WIKICREM consistently im- 
proves the performance of the models in all sce- 
narios and on most tests. Due to the small size of 
some test sets, some of the results are subject to 
deviation. This especially applies to PDP (60 test 
samples) and WNLI (145 test samples). 

We observe that BERT-_WIKIRAND generally 
performs worse than BERT, with GAP and PDP be- 
ing notable exceptions. This shows that BERT is 
a strong baseline and that improved performance 
of BERT_WIKICREM is not a consequence of 
training on shorter sentences or with different loss 
function. BERT_WIKICREM consistently out- 
performs both baselines on all tests, showing that 
WIKICREM can be used as a standalone dataset. 

We observe that training on the data from the 
target distribution improves the performance the 
most. Models trained on GAP-train usually show 
more than a 20% increase in their F}-score on 
GAp-test. Still, BERT-WIKICREM_GAP shows 


4309 


Transductive scenario 

Gap F, Gap FË Gap F" Bias ae DPR | Wsc | WNLI 
SOTA 72.1% 714% 72.8% 0.98 || 76.4% |72.5%| 74.7% 
BERT 50.0% 47.2% 52.7% 0.90 59.8% | 61.9% | 65.8% | no 
BERT_WIKIRAND 55.1% 51.8% 58.2% 0.89 59.2% | 59.3% | 65.8% | train 
BERT_WIKICREM 59.0% 57.5% 60.5% 0.95 67.4% | 63.4% |67.1%| data 
BERT_GAP 75.2% 75.1% 75.3% 1.00 66.8% | 63.0% | 68.5% 
BERT_WIKICREM_GAP]|| 77.4% 78.4% 76.4% 1.03 71.1% | 64.1% | 70.5% 
BERT_DPR 60.9% 61.3% 60.6% 1.01 83.3% | 67.0% | 71.9% | existing 
BERT_GAP_DPR 70.0% 70.4% 69.5% 1.01 79.4% | 65.6% | 72.6% | train 
BERT_WIKICREM DPR|| 64.2% 64.2% 64.1% 1.00 || 80.0% |71.8%|74.7%| data 
BERT_ALL 76.0% 77.4% 74.7% 1.04 80.1% |'70.0% | 74.0% 
BERT_WIKICREM ALL | 78.0% 79.4% 76.7% 1.04 || 84.8%|'70.0%| 74.7% 

WB Tl-a WB T1-p WB T2-a WB T2-p| WINOGENDER| PDP 
SOTA 60.6% 74.9% 78.0% 88.6% 50.9% 74.0% 
BERT 61.3% 60.3% 76.2% 75.8% 59.2% 71.7% | no 
BERT_WIKIRAND 53.5% 52.5% 64.6% 65.2% 57.9% 73.3% | train 
BERT_WIKICREM 65.2% 64.9% 95.7% 94.9% 66.7% 76.7%| data 
BERT_GAP 64.6% 63.8% 88.1% 87.9% 67.5% 85.0% 
BERT_WIKICREM GAP) 71.2% 70.5% 97.2% 98.2% 75.4% 83.3% 
BERT_DPR 78.0% 78.2% 856% 86.4% 79.2% 81.7% | existing 
BERT_GAP_DPR 77.8% 76.5% 89.6% 89.1% 75.8% 86.7%] train 
BERT_WIKICREM_DpR| 76.0% 76.38% 81.38% 80.3% 82.1% 76.7% | data 
BERT_ALL 77.8% 77.2% 94.7% 94.9% 78.8% 81.7% 
BERT_WIKICREM ALL | 76.8% 75.8% 98.7% 99.0% 76.7% 86.7% 


Table 1: Evaluation of trained models on all test sets. GAP and WINOBIAS (abbreviated WB) are additionally split 
into subsets, as introduced in Section 5. Double lines in the table separate results from three different scenarios: 
when no training data is available, when additional training data exists, and the transductive scenario. The table 
is further split into sections separated with single horizontal lines. Each section contains a model that has been 
trained on WIKICREM and models that have not been. The best result in each section is in bold. The best overall 
result is underlined. Scores on GAP are measured as F’,-scores, while the performance on other datasets is given 
in accuracy. The source of each SOTA is listed in Section 5. 


a consistent improvement over BERT_GAP on all 
subsets of the GAP test set. This confirms that W1- 
KICREM works not just as a standalone dataset, 
but also as an additional pre-training in the trans- 
ductive scenario. 


Similarly, BERT-WIKICREM_DPR outper- 
forms BERT_DPR on the majority of tasks, 
showing the applicability of WIKICREM to the 
scenario where additional training data is avail- 
able. However, good results of BERT.GAP_DPR 
show that additional training on a manually con- 
structed dataset, such as GAP, can yield similar 
results as additional training on WIKICREM. The 
reason behind this difference is the impact of the 
data distribution. GAP, DPR, and WIKICREM 
contain data that follows different distributions 
which strongly impacts the trained models. This 


can be seen when we fine-tune BERT_GAP on 
DPR to obtain BERT_GAP_DPR, as the model’s 
performance on GAP-test drops by 8.2%. WIKI- 
CREM’s data distribution strongly differs from 
the test sets’ as described in Section 3. 


However, the best results are achieved when 
all available data is combined, as shown by the 
models BERT_ALL and BERT_WIKICREM_ALL. 
BERT_WIKICREM_ALL achieves the high- 
est performance on GAP, DPR, WNLI, and 
WINOBIAS among the models, and sets the 
new state-of-the-art result on GAP, DPR, and 
WINOBIAS. The new state-of-the-art result 
on the WINOGENDER dataset is achieved by 
the BERT-WIKICREM- DPR model, while 
BERT_WIKICREM_ALL and BERT_GAP_DPR 
set the new state-of-the-art on the PDP dataset. 
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7 Conclusions and Future Work 


In this work, we introduced WIKICREM, a large 
dataset of training instances for pronoun resolu- 
tion. We use our dataset to fine-tune the BERT 
language model. Our results match or outper- 
form state-of-the-art models on 6 out of 7 evalu- 
ated datasets. 

The employed data-generating procedure can be 
further applied to other large sources of text to 
generate more training sets for pronoun resolution. 
In addition, both variety and size of the generated 
datasets can be increased if we do not restrict our- 
selves to personal names. We hope that the com- 
munity will make use of our released WIKICREM 
dataset to further improve the pronoun resolution 
task. 


Acknowledgments 


This work was supported by the Alan Turing Insti- 
tute under the UK EPSRC grant EP/N510129/1, 
by the EPSRC grant EP/RO13667/1, by the 
EPSRC studentship OUCS/EPSRC-NPIF/VK/ 
1123106, by the JP Morgan PhD Fellowship 2019- 
2020, and by an EPSRC Vacation Bursary. We 
also acknowledge the use of the EPSRC-funded 
Tier 2 facility JADE (EP/P020275/1). 


References 


Ernest Davis, Leora Morgenstern, and Charles L. Or- 
tiz. 2017. The first Winograd Schema Challenge at 
IJCAI-16. AI Magazine, 38(3):97-98. 


Jacob Devlin, Ming-Wei Chang, Kenton Lee, and 
Kristina Toutanova. 2018. BERT: Pre-training 
of deep bidirectional transformers for language 
understanding. Computing Research Repository, 
arXiv:1810.04805. 


Greg Durrett and Dan Klein. 2013. Easy victories and 
uphill battles in coreference resolution. In Proceed- 
ings of the 2013 Conference on Empirical Methods 
in Natural Language Processing, pages 1971-1982, 
Seattle, Washington, USA. Association for Compu- 
tational Linguistics. 


Ali Emami, Noelia De La Cruz, Adam Trischler, 
Kaheer Suleman, and Jackie Chi Kit Cheung. 
2018. A knowledge hunting framework for common 
sense reasoning. Computing Research Repository, 
arXiv: 1810.01375. 


Ali Emami, Paul Trichelair, Adam Trischler, Kaheer 
Suleman, Hannes Schulz, and Jackie Chi Kit Che- 
ung. 2019. The KNOWREF Coreference Corpus: 
Removing gender and number cues for difficult 
pronominal anaphora resolution. In ACL 2019. 


Johannes Fahndrich, Sabine Weber, and Hannes Kan- 
thak. 2018. A marker passing approach to Winograd 
schemas. In Semantic Technology, pages 165-181, 
Cham. Springer International Publishing. 


Abbas Ghaddar and Philippe Langlais. 2016. Wiki- 
Coref: An English coreference-annotated corpus of 
Wikipedia articles. In Proceedings of the Tenth In- 
ternational Conference on Language Resources and 
Evaluation (LREC 2016), Portoroz, Slovenia. Euro- 
pean Language Resources Association (ELRA), Eu- 
ropean Language Resources Association (ELRA). 


Liane Guillou. 2012. Improving pronoun translation 
for statistical machine translation. In Proceedings of 
the Student Research Workshop at the 13th Confer- 
ence of the European Chapter of the Association for 
Computational Linguistics, pages 1-10. Association 
for Computational Linguistics. 


Vid Kocijan, Ana-Maria Cretu, Oana-Maria Cam- 
buru, Yordan Yordanov, and Thomas Lukasiewicz. 
2019. A surprisingly robust trick for Winograd 
Schema Challenge. Computing Research Reposi- 
tory, arXiv:1905.06290. 


Heeyoung Lee, Angel Chang, Yves Peirsman, 
Nathanael Chambers, Mihai Surdeanu, and Dan 
Jurafsky. 2013. Deterministic coreference resolu- 
tion based on entity-centric, precision-ranked rules. 
Computational Linguistics, 39(4):885-916. 


Kenton Lee, Luheng He, Mike Lewis, and Luke Zettle- 
moyer. 2017. End-to-end neural coreference reso- 
lution. In Proceedings of the 2017 Conference on 
Empirical Methods in Natural Language Process- 
ing, pages 188-197, Copenhagen, Denmark. Asso- 
ciation for Computational Linguistics. 


Hector J. Levesque, Ernest Davis, and Leora Mor- 
genstern. 2011. The Winograd Schema Challenge. 
AAAI Spring Symposium: Logical Formalizations of 
Commonsense Reasoning, 46. 


Fei Liu, Luke S. Zettlemoyer, and Jacob Eisenstein. 
2019. The referential reader: A recurrent entity net- 
work for anaphora resolution. Computing Research 
Repository, abs/1902.01541. 


Christopher D. Manning, Mihai Surdeanu, John Bauer, 
Jenny Finkel, Steven J. Bethard, and David Mc- 
Closky. 2014. The Stanford CoreNLP natural lan- 
guage processing toolkit. In Association for Compu- 
tational Linguistics (ACL) System Demonstrations, 
pages 55-60. 


Kotaro Nakayama. 2019. Wikipedia mining for triple 
extraction enhanced by co-reference resolution. 


Haoruo Peng, Daniel Khashabi, and Dan Roth. 2015. 
Solving hard coreference problems. In ALT- 
NAACL. 


4311 


Sameer Pradhan, Alessandro Moschitti, Nianwen Xue, 
Olga Uryupina, and Yuchen Zhang. 2012. CoNLL- 
2012 Shared Task: Modeling multilingual unre- 
stricted coreference in OntoNotes. In Joint Confer- 
ence on EMNLP and CoNLL - Shared Task, CONLL 
°12, pages 1—40, Stroudsburg, PA, USA. Association 
for Computational Linguistics. 


Alec Radford, Jeff Wu, Rewon Child, David Luan, 
Dario Amodei, and Ilya Sutskever. 2019. Language 
models are unsupervised multitask learners. 


Altaf Rahman and Vincent Ng. 2012. Resolving com- 
plex cases of definite pronouns: The Winograd 
Schema Challenge. In Proceedings of EMNLP. 


Rachel Rudinger, Jason Naradowsky, Brian Leonard, 
and Benjamin Van Durme. 2018. Gender bias in 
coreference resolution. Computing Research Repos- 
itory, abs/1804.09301. 


Ulrich Schafer, Christian Spurk, and Jorg Steffen. 
2012. A fully coreference-annotated corpus of 
scholarly papers from the ACL Anthology. Proceed- 
ings of COLING 2012: Posters, pages 1059-1070. 


T. H. Trinh and Q. V. Le. 2018. A Simple Method 
for Commonsense Reasoning. Computing Research 
Repository, arXiv: 1806.02847. 


Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob 
Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz 
Kaiser, and Illia Polosukhin. 2017. Attention is 
all you need. Computing Research Repository, 
arXiv: 1706.03762. 


Alex Wang, Amanpreet Singh, Julian Michael, Felix 
Hill, Omer Levy, and Samuel R. Bowman. 2019. 
GLUE: A multi-task benchmark and analysis plat- 
form for natural language understanding. In the Pro- 
ceedings of ICLR. 


Kellie Webster, Marta Recasens, Vera Axelrod, and Ja- 
son Baldridge. 2018. Mind the GAP: A balanced 
corpus of gendered ambiguous pronouns. In Trans- 
actions of the ACL. 


Terry Winograd. 1972. Understanding Natural Lan- 
guage. Academic Press, Inc., Orlando, FL, USA. 


Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Or- 
donez, and Kai-Wei Chang. 2018. Gender bias in 
coreference resolution: Evaluation and debiasing 
methods. In Proceedings of the 2018 Conference 
of the North American Chapter of the Association 
for Computational Linguistics: Human Language 
Technologies, Volume 2 (Short Papers), pages 15- 
20, New Orleans, Louisiana. Association for Com- 
putational Linguistics. 


4312 


